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Two-Sample Smooth Tests for the Equality of Distributions 

Wen-Xin Zhou*, Chao Zheng \ and Zhen Zhang ^ 

Abstract 

This paper considers the problem of testing the equality of two unspecified distri¬ 
butions. The classical omnibus tests such as the Kolmogorov-Smirnov and Cramer-von 
Mises are known to suffer from low power against essentially all but location-scale al¬ 
ternatives. We propose a new two-sample test that modifies the Neyman’s smooth 
test and extend it to the multivariate case based on the idea of projection pursue. The 
asymptotic null property of the test and its power against local alternatives are studied. 

The multiplier bootstrap method is employed to compute the critical value of the mul¬ 
tivariate test. We establish validity of the bootstrap approximation in the case where 
the dimension is allowed to grow with the sample size. Numerical studies show that 
the new testing procedures perform well even for small sample sizes and are powerful 
in detecting local features or high-frequency components. 

Keywords: Neyman’s smooth test; Goodness-of-fit; Multiplier bootstrap; High-frequency 
alternations; Two-sample problem 

1 Introduction 

Let X and Y be two valued random variables with continuous distribution functions F 
and G, respectively, where p > 1 is a positive integer. Given data from each of the two 
unspecified distributions F and G, we are interested in testing the null hypothesis of the 
equality of distributions 


Ho:F = G versus Hi : F ^ G. (1.1) 

This is the two-sample version of the conventional goodness-of-fit problem, which is one 
of the most fundamental hypothesis testing problems in statistics (Lehmann and Romano, 
2005). 
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1.1 Univariate case: p = 1 

Suppose we have two independent univariate random samples Xn = {Xi,...,Xn} and 
= {Yi, • ■ ■) Ym} from F and G, respectively. The empirical distribution functions (EDF) 
are given by 

- n - m 

Fn{x) =-'^I{Xi<x) and Gm{y) =—'^I{Yj < y). (1.2) 

1=1 j=i 

For testing the equality of two univariate distributions, conventional approaches in the 
literature use a measure of discrepancy between Fn and Gm as a test statistic. Prototypical 
examples include the Kolmogorov-Smirnov (KS) test Tks = -sjnmj (n + m) sup^g^ \ Yn{t) — 
Gmit)\ and the Cramer-von Mises (CVM) family of statistics 

/ CXD 

{Fn{t) - Gm{t)}‘^w{Hn+rnit)) dHn,m{t), 

-oo 

where '■= {nFriit) + mGmit)}/{n + m) denotes the pooled FDF and tc is a non¬ 

negative weight function. Taking w = 1 yields the CramCT-von Mises statistic, and w{t) = 
{t(l — yields the Anderson-Darling statistic (Darling, 1957). 

The traditional omnibus tests, which have been widely used for testing the two-sample 
goodness-of-fit hypothesis (1.1) due to their simplicity with which they can be performed, 
suffer from low power in detecting densities containing high-frequency components or local 
features such as bumps, and thus may have poor finite sample power properties (Fan, 1996). 
It is known from empirical studies that the CVM test has poor power against essentially 
all but location-scale alternatives (Fubank and LaRiccia, 1992). The same issue arises in 
the KS test as well. To enhance power under local alternatives, Neyman’s smooth method 
(Neyman, 1937) was introduced earlier than the traditional omnibus tests, to test only the 
first d-dimensional sub-problem if there is prior that most of the discrepancies fall within the 
first d orthogonal directions. Essentially, Neyman’s smooth tests represent a compromise 
between omnibus and directional tests. As evidenced by numerous empirical studies over 
the years, smooth tests have been shown to be more powerful than traditional omnibus 
tests over a broad range of realistic alternatives. See, for example, Eubank and LaRiccia 
(1992), Fan (1996), Janssen (2000), Bera and Ghosh (2002) and Escanciano (2009). 

A two-sample analogue of the Neyman’s smooth test was recently proposed by Bera, 
Ghosh and Xiao (2013) for testing the equality of F and G based on two independent 
samples. The test statistic is asymptotically chi-square distributed and as a special case 
of Rao’s score test, it enjoys certain optimality properties. Specifically, Bera, Ghosh and 
Xiao (2013) motivated the two-sample Neyman’s smooth test by considering the random 
variable V = F{Y) with distribution and density functions given by 

Hiz):=G{F-\z)) and p{z) := g{F-\z))/fiF-\z)) (1.3) 
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for 0 < z < 1, respectively, where F~^ is the quantile function of X, i.e. F~^{z) = inf{x G 
M : F{x) > z}, and / and g denote the density functions of X and Y. Assume that 
F and G are strictly increasing, then H is also increasing, p{z) > 0 for 0 < z < 1 and 
fg p(z) dz = 1. Under the null hypothesis Hq, p = 1 so that V =d U{0, 1). In other words, 
the null hypothesis Hq in (1.1) is equivalent to 

Ho : p{z) = 1 for all 0 < z < 1, (1-4) 


where p is as in (1.3). Throughout, the function p is referred as the ratio density function. 
Based on Neyman’s smooth test principle, we restrict attention to the following smooth 
alternatives to the null of uniformity 

P6)(5;) = C'rf(6»)exp I ^6»fc?/;fc(z)| for 0 := (01,..., 0^)^ G and 0 < 2 ; < 1, (1.5) 

k=l ^ 


which include a broad family of parametric alternatives, where d = dim(0) is some pos¬ 
itive integer and {Cd{0)}~^ = fo {Ylk=i^k'4^k{z)} dz. Setting ipo = 1, the functions 
'(/’i,..., V'd S’l'e chosen in such a way that {'i/’O; V'lj • • ■ i i^d} forms a set of orthonormal func¬ 
tions, i.e. 


f 


il)k{z)'ipi{z) dz = Ski 


1, if A: = i, 
0, if A: / i. 


( 1 . 6 ) 


The null hypothesis asserts Hq : 0 = 0. Assuming that m <n and the truncation parameter 
d is fixed, the two-sample smooth test proposed by Bera, Ghosh and Xiao (2013) is defined 
as Tbgx = where tp = m-^ = (V’l, • • ■, V'd)'^ and Vj = Fn(Yj). 

Under certain moment conditions and if the sample sizes (n, m) satisfy m log log n = o(n) 
as n,m ^ 00 , the test statistic Tbgx converges in distribution to the distribution with 
d degrees of freedom. Accounting for the error of estimating F, Bera, Ghosh and Xiao 
(2013) further considered a generalized version of the smooth test that is asymptotically 
X^{d) distributed and can be applied when n and m are of the same magnitude. 

However, Bera, Ghosh and Xiao (2013) only focused on the fix d scenario (i.e. d = A) 
so that their two-sample smooth test is consistent in power against alternative where V = 
F(V) does not have the same first k moments as that of the uniform distribution (Lehmann 
and Romano, 2005). If there is a priori evidence that most of the energy is concentrated at 
low frequencies, i.e. large 0k are located at small k, it is reasonable to use Neyman’s smooth 
test. Otherwise, Neyman’s test is less powerful when testing contiguous alternatives with 
local characters (Fan, 1996). As Janssen (2000) pointed out, achieving reasonable power 
over more than a few orthogonal directions is hopeless. Indeed, the larger the value of d, the 
greater the number of orthogonal directions used to construct the test statistic. Therefore, 
it is possible to obtain consistency against all distributions if the truncation parameter d 
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is allowed to increase with the sample size. Motivated by Chang, Zhou and Zhou (2014), 
we regard : 0 = 0 as a global mean testing problem with dimension increasing with the 
sample size. When d is large, Neyman’s smooth test which is based on the ^ 2 -iiorm of 9 
may also suffer from low powers under sparse alternatives. In part, this is because that the 
quadratic statistic accumulates high-dimensional estimation errors under Hq, resulting in 
large critical values that can dominate the signals under sparse alternatives. 

To overcome the foregoing drawbacks, we first note that the traditional omnibus tests 
aim to capture the differences of two entire distributions as opposed to only assessing a 
particular aspect of the distributions, by contrast, Neyman’s smooth principle reduces the 
original nonparametric problem to a d-dimensional parametric one. Lying in the middle, 
we are interested in enhancing the power in detecting two adjacent densities where one has 
local features or contains high-frequency components, while maintaining the same capabil¬ 
ity in detecting smooth alternative densities as the traditional tests. We expect to arrive 
at a compromise between desired significance level and statistical power by allowing the 
truncation parameter d to increase with sample sizes. In Section 2, we introduce a new test 
statistic by taking maximum over d univariate statistics. The limiting null distribution is 
derived under mild conditions, while d is allowed to grow with n and m. To conduct infer¬ 
ence, a novel intermediate approximation to the null distribution is proposed to compute 
the critical value. In fact, when n and m are comparable, d can be of order (resp. 

(up to logarithmic in n factors) if the trigonometric series (resp. Legendre polynomial 
series) is used to construct the test statistic. 

1.2 Multivariate case: p >2 

As a canonical problem in multivariate analysis, testing the equality of two multivariate 
distributions based on the two samples has been extensively studied in the literature that 
can be dated back to Weiss (1960), under the conventional fix p setting. Friedman and Raf- 
sky (1979) constructed a two-sample test based on the minimal spanning tree formed from 
the interpoint distances and their test statistic was shown to be asymptotically distribu¬ 
tion free under the null; Schilling (1986) and Henze (1988) proposed nearest neighbor tests 
which are based on the number of times that the nearest netghbors come from the same 
group; Rosenbaum (2005) proposed an exact distribution-free test based on a matching of 
the observations into disjoint pairs to minimize the total distance within pairs. Work in 
the context of nonparametric tests include that of Hall and Tajvidi (2002), Baringhaus and 
Franz (2004, 2010) and Biswas and Ghosh (2014), among others. 

Most aforementioned existing methods are tailored for the case where the dimension 
p is fixed. Driven by a broad range of contemporary statistical applications, analysis of 
high-dimensional data is of significant current interest. In the high-dimensional setting, 
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the classical testing procedures may have poor power performance, as evidenced by the 
numerical investigations in Biswas and Ghosh (2014). Several tests for the equality of two 
distributions in high dimensions have been proposed. See, for example. Hall and Tajvidi 
(2002) and Biswas and Ghosh (2014). However, limiting null distributions of the test 
statistics introduced in Hall and Tajvidi (2002) and Biswas and Ghosh (2014) were derived 
when the dimension p is fixed. 

In the present paper, we propose a new test statistic that extends Neyman’s smooth 
test principle to higher dimensions based on the idea of projection pursue. To conduct 
inference for the test, we employ the multiplier (wild) bootstrap method which is similar in 
spirit to that used in Hansen (1996) and Barrett and Donald (2003). We refer to Section 3 
for details on methodologies. It can be shown that (Propositions 7.2 and 7.3), under mild 
conditions, the error in size of our multivariate smooth test decays polynomially in sample 
sizes (n, m). It is noteworthy that we allow the dimension p to grow as a function of (n, m), 
a type of framework the existing methods do not rigorously address. More importantly, we 
do not limit the dependency structure among the coordinates in X and Y and no shape 
constraints of the distribution curves are known as a priori which inhibits a pure parametric 
approach to the problem. 

1.3 Organization of the paper 

The rest of the paper is organized as follows. In Section 2, we describe the two-sample 
smooth testing procedure in the univariate case. An extension to the multivariate setting 
based on projection pursue is introduced in Section 3. Section 4 establishes theoretical 
properties of the proposed smooth tests in both univariate and multivariate settings. Finite 
sample performance of the proposed tests is investigated in Section 5 through Monte Carlo 
experiments. The proofs of the main results are given in Section 7 and some additional 
technical arguments are contained in the Appendix. 

Notation. For a positive integer p, we write [p] = {1, 2,... ,p} and denote by | • I 2 and | • |oo 
the (- 2 - and .^oo-norm in MP, respectively, i.e. \x \2 = (a^iT- • and |x|oo = max^gjp] \xk\ 

for X = (xi,..., XpY G M^. The unit sphere in is denoted by 5^“^ = {x G : |x |2 = 1}. 
For two valued random variables X and T, we write X =ci Y \i they have the same 
probability distribution and denote by Px the probability measure on induced by X. 
For two real numbers a and b, we use the notation a\/ b = max(a, b) and a Ab = min(a, b). 
For two sequences of positive numbers an and bn, we write an bn if there exist constants 
Cl, C 2 > 0 such that for all sufficiently large n, ci < anjbn < C 2 , we write an = 0(bn) if there 
is a constant C > 0 such that for all n large enough, an < Cbn, and we write an ~ bn or 
an — bn and an = o{bn), respectively, if lim^^oo anjbn = 1 and lim„_s. 00 anjbn = 0. For any 
two functions /, p : M (->■ M, we denote with fog the composite function f o g[x) = f{g{x)} 
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for X G M. 

For any probability measure Q on a measurable space (S,5), let || • \\q ^2 be L 2 {Q)- 
seminorm defined by ||/||q ,2 = = (f \f\^ for / G L 2 {Q). For a class 

of measnrable functions eqnipped with an envelope F{s) = supj-^jr |/(s)| for s G S, let 
L2{Q),e\\F\\Q^2) denote the e-covering number of the class of functions F with respect 
to the L 2 (Q)-distance for 0 < e < 1. We say that the class T is Euclidean or VC-type (Nolan 
and Pollard, 1987; van der Vaart and Wellner, 1996) if there are constants > 0 such 
that supQ L 2 (< 5 ), e||-F||Q, 2 ) < for all 0 < e < 1, where the supremum ranges 

over over all finitely discrete probability measures on (§,5). When S = for p > 1, we 
use S to denote the Borel u-algebra unless otherwise stated. 


2 Testing equality of two univariate distributions 


2.1 Oracle procedure 


Without loss of generality, we assume n > m and recall that the null hypothesis Hq : F = G 
is equivalent to FIq : V =d U{0, 1) for V = F(Y) as in (1.4). Following Bera, Ghosh and 
Xiao (2013), we consider the smooth alternatives lying in the family of densities (1.5) 
which is a d-parameter exponential family, where d = dn,m is allowed to increase with n 
and m in order to obtain power against a large array of alternatives. In particular, this 
family is quadratic mean differentiable at d = 0 and therefore the score vector at 0 = 0 is 
given by fog Lm{0 ),..., ^ fog Lm(d)) 1^=0 (Lehmann and Romano, 2005), where 

Lm{d) = {C'd(d)}”* exp { Ylk=i (^kYkiVj)} is the likelihood function and Vj = F{Yj), 

such that logLm(d) = YlT=i^YkiVj) - EelijkiVj)}]. As {V’o = ,Yd} forms a set 

of orthonormal functions, it is easy to see that if 0 = 0, E^l-ipkiV)} = 0 and E 0 {^k(yY} = 1 
for every k £ [d]. 

To provide a more omnibus test against a broader range of alternatives, we allow a large 
truncation parameter d and for the reduced null hypothesis Hq : d = 0, it is instructive to 
consider the following oracle statistic 


T(d) 


max 

l<k<d 



i=i 


( 2 . 1 ) 


which can be regarded as a smoothed version of the KS statistic. Throughout, the number 
of orthogonal directions d = dim(d) is chosen such that d < n Am. Intuitively, this extreme 
value statistic is appealing when most of the energy (non-zero 6k) is concentrated on a 
few dimensions bnt with unknown locations, meaning that both low- and high-freqnency 
alternations are possible. Now it is a common belief (Cai, Liu and Xia, 2014) that maximum- 
type statistics are powerful against sparse alternatives, which in the current context is the 
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case where the two densities only differ in a small number of orthogonal directions (not 
necessarily in the first few). To see this, consider a contiguous alternative where there 
exists some t* such that 6i* 7 ^ 0 with \6i* \ sufficiently small and = 0 for all other £, then 
informally we have = Jq exp{d£*'ip£* (z)} dz ~ /o^{l + &£*'4’£*i^)} dz = 1 and 

Eeii’kiV)} = Cd{0) ilJk{z)exp{6£*ip£*{z)}dz / iljkiz){l + 9£*'ip£*{z)} dz = 0£* Sm*■ 

Jo Jo 

Under a sparse alternative where only a few components of 0 = (0i,..., are non-zero, 
the power typically depends on the magnitudes of the signals (non-zero coordinates of 9) 
and the number of the signals. 

2.2 Data-driven procedure 

For the oracle statistic 'h(d) in (2.1), the random variables Vj = F{Yj) are not directly 
observed as the distribution function F is unspecified. Indeed, this is the major difference 
of the two-sample problem from the classical (one-sample) goodness-of-fit problem. We 
therefore consider the following data-driven procedure. In the first stage, an estimate Vj of 
Vj is obtained by using the empirical distribution function Fn- 

1 ” 

y = Fniy) = -J2 ^ ( 2 . 2 ) 

^ • 1 
1=1 

Then the data-driven version of ^{d) in (2.1) is defined by 

T = T((i) = 4 / max {fikl- (2-3) 

^ y n + m i<k<d' ^ ^ 

where fik = " 2 ,“^ yk{Vj)- In the case m > n, we may use Gm instead of Fn, leading 
to an alternative test statistic ^(d) = \/nmj{n + m) maxi<k<d yk{Gm{Xi))\- 

Typically, large values of T lead to a rejection of the null Hq : 9 = 0 and hence of 
Hq : F = G, ot equivalently, Hq in (1.4). For conducting inference, we need to compute the 
critical value so that the corresponding test has approximately size a. A natural approach 
is to derive the limiting distribution of the test statistic T (d) under the null. Under certain 
smoothness conditions on fik, it can be shown that for every k £ [d], 

- m - m 1 ^ 

j=l j=l i=l 

where Ui = G{Xi). See, for example, (7.7) and (7.10) in the proof of Proposition 7.1. Under 
Hq and when d > 1 is fixed, a direct application of the multivariate central limit theorem 
is that as n,m ^ 00 , 

.P^{yi,...,ydy ^G=dN{0,ld). (2.4) 
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where 1^ is the d-dimensional identity matrix. This implies by the continuous mapping 
theorem that T(d) |G|oo when d is fixed. 

For every 0 < a < 1, denote by Za the (1 — Q;)-quantile of the standard normal dis¬ 
tribution, i.e. Za = — a). Then, the (1 — a)-quantile of |G|oo can be expressed as 

Ca{d) = '2i/2-(i-a)i/<i/2 = <^“^(1/2 -|- (1 — a)^/'^/2). The corresponding asymptotic a-level 
Smooth test is thus defined as 


^l{d)=l{i!{d)>Ca{d)]. (2.5) 

The null hypothesis Hq is rejected if and only if d>^(d) = 1. 

To construct test that has better power for alternative densities with large energy at 
high frequencies, we allow the truncation parameter d = dim(d) to increase with sample 
sizes n and m. This setup was previously considered by Fan (1996) in the context of the 
Gaussian white noise model, where it was argued that if there is a priori evidence that large 
9k s are located at small k, then it is reasonable to select a relatively small d; otherwise the 
resulting test may suffer from low power in detecting densities containing high-frequency 
components. However, by letting d to increase with sample sizes we allow for different 
asymptotics than Neyman’s fix d large sample scenario. This type of asymptotics aims 
to illustrate how the truncation parameter d may affect the quality of the test statistic, 
and to depict a more accurate picture of the behavior for fixed samples. In the present 
two-sample context, it will be shown (Proposition 7.1) that the distribution of T(d) can 
still be consistently estimated by that of |G|oo with the truncation parameter d increasing 
polynomially in n and m, where G is a d-dimensional centered Gaussian random vector 
with covariance matrix 1^. Gonsequently, the asymptotic size of the smooth test ‘h® (d) in 
(2.5) coincides with the nominal size a (Theorem 4.1). 

2.3 Choice of the function basis 

In this paper, we shall focus the following two sets of orthonormal functions with respect to 
the Lebesgue measure on [0,1], which are the most commonly used basis for constructing 
smooth-type goodness-of-fit tests. 

(i) (Legendre Polynomial (LP) series). Neyman’s original proposal (Neyman, 1937) was 
to use orthonormal polynomials, now known as the normalized Legendre polynomials. 
Specifically, is chosen to be a polynomial of degree k which is orthogonal to all 
the ones before it and is normalized to size 1 as in (1.6). Setting 'ijjQ = 1, the next 
four V’fc’s are explicitly given by: 'il^i{z) = ^/3{2z — 1), 'ip 2 {z) = V5{6z‘^ — 6z + 1), 
^ 3 (^) = y/7{20z^ - 30^2 -h 12z - 1) and = 3(70z^ - UOz^ + 90z‘^ - 20z -h 1). In 
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general, the normalized Legendre polynomial of order k can be written as 


= 0<Z<1, fc = l,2,.... (2.6) 

See, for example. Lemma 1 in Bera, Ghosh and Xiao (2013). 

(ii) (Trigonometric series). Another widely used basis of orthonormal functions is a 
trigonometric series given by 

'ipkiz) = V2cos{7rkz), 0 < z < 1, A: = l,2,.... (2-7) 

This particular choice arises in the construction of the weighted quadratic type test 
statistics, including the Cramer-von Mises and the Anderson-Darling test statistics 
as prototypical examples (Eubank and LaRiccia, 1992; Lehmann and Romano, 2005). 
Alternatively, one could use the Fourier series which is also a popular trigonometric 
series given by {cos(27r/i;z), sin(27r/i;z) : fc = 1,..., d/2} for d even. 

Other commonly used compactly supported orthonormal series include spline series (de 
Boor, 1978), Cohen-Deubechies-Vial wavelet series (Mallat, 1999) and local polynomial par¬ 
tition series (Cattaneo and Farrell, 2013), among others. As the two-sample test statistics 
constructed in this paper use orthonormal functions that are at least twice continuously 
differentiable on [0,1], we restrict attention to the Legendre polynomial series (2.6) and the 
trigonometric series (2.7) only. Indeed, the idea developed here can be directly applied to 
construct (one-sample) goodness-of-fit tests in one and higher dimensions without imposing 
smoothness conditions on the series. 

3 Testing equality of two multivariate distributions 

Evidenced by both theoretical (Section 4.1) and numerical (Section 5) studies, we see that 
Neyman’s smooth test principle leads to convenient and powerful tests for univariate data. 
However, the presence of multivariate joint distributions makes it difficult, or even unrealis¬ 
tic, to consider a direct multivariate extension of the smooth alternatives given in (1.5). In 
the case of complete independence where all the components of X and Y are independent, 
the problem for testing equality of two multivariate distributions is equivalent to that for 
testing equality of many marginal distributions. Neyman’s smooth principle can therefore 
be employed to each of the p marginals. 

In this section, we do not impose assumption that limits the dependence among the 
coordinates in X and Y and note here that the null hypothesis Hq : F = G is equivalent 
to Hq : u^X =d u^Y, \/u G 5^“^. This observation and the idea of projection pursue now 
allow to apply Neyman’s smooth test principle, yielding a family of univariate smooth tests 


9 




indexed by u G 5^ ^ based on which we shall construct our test that incorporates the 
correlations among all the one-dimensional projections. 


3.1 Test statistics 


Assume that two independent random samples, Ai,...,A„ from the distribution F and 
Yi,..., Ym from the distribution G are observed, where the two samples sizes are comparable 
and m < n. Along every direction u G 5^“^, let and be the distribution functions 
of one-dimensional projections u^X and u^Y, respectively, and define the corresponding 
empirical distribution functions by 

- n ^ m 

Fn{x) =-'^iWXiYx) and G^{y) =—^I{u^Yj < y). (3.1) 

2=1 j = \ 

As a natural multivariate extension of the KS test, we consider the following test statistic 


^MKS = 


nm 

- sup 

n + m 


\F:it)-G-^it)\, 


which coincides with the KS test when p = 1. Baringhaus and Franz (2004) proposed a 
multivariate extension of the Cramer-von Mises test which is of the form 

4-bp = ^ f /^“{F“(t) - GUt)?dt^{du), 

n + m Jsp-i J-oc 

where d denotes the Lebesgue measure on S^~^. Despite their popularities in practice, the 
classical omnibus distribution-based testing procedures suffer from low power in detecting 
fine features such as sharp and short aberrants as well as global features such as high- 
frequency alternations (Fan, 1996). Now it is well-known that the foregoing drawbacks 
can be well repaired via smoothing-based test statistics. This motivates the following 
multivariate smooth test statistic. 

As in Section 2, let {V’o = Ij V'l) ■ ■ • > V'd} {d > 1) be a set of orthonormal functions 
and put Ip = {ijji,... ,'4>dy ■ IK ^ Using the union-intersection principle, the two- 
sample problem of testing Hq : F = G versus Hi : F ^ G can be expressed a collection of 
univariate testing problems, by noting that Hq and Hi are equivalent to and 

respectively, where 


Hu,o : u^X =d vdY, Hu,I : u^X uW. 


For every marginal null hypothesis Hu,o, we consider a smooth-type test statistic in the 
same spirit as in Section 2.1 that 


^u{d) 


- lit 

with F” = F>Ty;,), 

j = l OO 


(3.2) 
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For diagnostic purposes, it is interesting to find the best separating direction, i.e. Umax := 
argmax^g^p-i ^uid), along which the two distributions differ most. For the purpose of 
conducting inference which is the main objective in this paper, we just plug Umax into (3.2) 
to get the oracle test statistic ^ max ld) := though it is practically infeasible as 

the distribution function F is unspecified. The most natural and convenient approach is to 
replace in (3.2) with = F^{u^Yj), leading to the following extreme value statistic 
for testing Hq : F = G, 

^ ^ / TlTTl 

^max = ^max(d) = W-^- SUp ^n(d), (3.3) 

M n + m „g 5 p-i 

where ^u{d) = \m~^ = maxi<fc<rf with '^fc(^/)- 

Rejection of the null is thus for large values of ^max, say 'hmax > Ca{d), where Ca{d) is a 
critical value to be determined so that the resulting test has the pre-specified significance 
level a G (0,1) asymptotically. 


3.2 Critical values 


Due to the highly complex dependence structure among {'4’u,k}{u,k)&sv-^x[d\^ the limiting 
(null) distribution of ^max(rf) may not exist. In fact, ^ ma x(d) can be regarded as the 
supremum of an empirical process indexed by the class 

Pn,m := |x I-)- il^ki^'^lWix - Xi) > 0}^ : (u, k) G 5^“^ x [d] 


of functions MP i—)■ M; that is, ^max(f^) = \/nm/ (n + m) sup^^^^ \m~^ '^j=i 

allow the dimension p to grow with sample sizes, the “complexity” of J-n,m increases with 
n,m and is thus non-Donsker. Therefore, the extreme value statistic 'I'max('^); even after 
proper normalization, may not be weakly convergent as n, m —>• oo. Tailored for such non- 
Donsker classes of functions that change with the sample size, Chernozhukov, Chetverikov 
and Kato (2013) developed Gaussian approximations for certain maximum-type statistics 
under weak regularity conditions. This motivates us to take a different route by using 
the multiplier (wild) bootstrap method to compute the critical value Ca{d) for the statistic 
^max(d) so that the resulting test has approximately size a G (0,1). 

Let {Zi, ..., Zn} = {Yi,..., Ym, Xi, ..., Xn} denote the pooled sample with a total 
sample size N = n + m. For every (u, k) G 5^“^ x [d], we shall prove in Section 7.4.2 that 


N 


nm ; 1 , 

—^ - ~7Yr V'fc o F [u^Zi), 

n + m Jn ^ 

.7 = 1 


where wj = ^njm for j G [m] and wj = —\/mJn for j G m-|- [n]. This implies that, under 
certain regularity conditions, 4'max(rf) — sup^g^-p Sj=i where = {x i—)• 
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ipk o F'^{u^x) : {u,k) G 5^ ^ X [d]}. Furthermore, we shall prove in Proposition 7.2 that, 
under the null hypothesis Hq : F = G, 

^max(d) ||G||j-p := sup |G/|, (3.4) 

where G is a centered Gaussian process indexed by with covariance function 
E{iGfu,k)iGfy,e)} 

= Px{fu,kfv,i) = E{f^^k{X)f,^e{X)} = [ ME^{Fx))ME'’{v^x)) dF{x), (3.5) 

JRP 

where [/“ = F^{FX) =d Unif(0,l) for u G 5^“^. In particular, E{{Gfu^k){Gfu,e)'\ = 
5ki- The distribution of ||G||x-p, however, is unspecified because its covariance function 

d 

is unknown. Therefore, in practice we need to replace it with a suitable estimator, and 
then simulate the Gaussian process G to compute the critical value Ca{d) numerically, as 
described below. 

Multiplier bootstrap. 


(i) Independent of the observed data and {Y)}^^, generate i.i.d. standard 

normal random variables ei,..., e^. Then construct the Multiplier Bootstrap statistic 


^MB 
^ max 


= 

^ max 


id) 


sup 

(u,k)£SP~^ X [d] 


1 


(3.6) 


where JJf = F^{MXi) for T“(-) as in (3.1). 

(ii) Calculate the data-driven critical value c^^{d) which is defined as the conditional 
(1 — Q;)-quantile of given that is, 


cTid) = inf {t G M : > t) < a} (3.7) 


where Pe denotes the probability measure induced by the normal random variables 
{ei}2^i conditional on 

For every t > 0, Pe(T()J^ < t) is a random variable depending on and so 

is c^^(d), which can be computed with arbitrary accuracy via Monte Carlo simulations. 
Consequently, we propose the following Multivariate Smooth test 

<!>^^{d) = l{Tn,ax(d) > (3.8) 

The null hypothesis Hq : F = G is rejected if and only if = 1. 
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4 Theoretical properties 

Assume that we are given independent samples from the two (univariate and multivariate) 
distributions. As different sample sizes are allowed, for technical reasons we need to impose 
assumptions about the way in which samples sizes grow. The following gives the basic 
assumptions on the sampling process. 

Assumption 4.1. 

(i) {Xi ,..., Xn} and {Yi ,..., Ym} are two independent random samples from X and Y, 
with absolute continuous distribution functions F and G, respectively; 

(ii) The sample sizes, n and m, are comparable in the sense that cqu < m < n for some 
constant 0 < cq < 1. 

Let {ipo = ... ,ijjd} be a sequence of twice differentiable orthonormal functions 

[0,1] I—>■ M, where d > 1 is the truncation parameter. Moreover, for i = 0,1, 2, define 

Bm= max max max (4.1) 

i<k<d i<k<dze[o,i] ^ 

These quantities will play a key role in our analysis. For the particular choice of the 
function basis as in (2.6) and (2.7), we specify below the order of B^d, as a function of d, 
for .^ = 0,1, 2. 

(i) (Legendre polynomial series). For the normalized Legendre polynomials V’fcj h is 

known that Bq^ = maxi<fc<(^max,jg[o,i] \'4^k{z)\ = \/2d + 1. See, e.g. Sansone (1959). 
Moreover, by the Markov inequality (Shadrin, 1992), ||V'(,||oo < ^^llV'fclloo and ||^/^^^||oo < 
^ llV^fclloo- Together with (4.1), this implies 

Sod = \/2d + l, Sirf<\/3d®/2 B2d<^~^/^d^/‘^. (4.2) 

(ii) (Trigonometric series). For the trigonometric series = \/2cos(7rA:z), it is straight¬ 
forward to see that V’fc(^) = —\/2 7rA:sin(7r/cz) and = —y/2 7r‘^k‘^cos{7rkz). Con¬ 

sequently, we have 

Bod = V2, Bid<V2 7rd and B 2 d < ■k'^ d?. (4.3) 

4.1 Asymptotic properties of $®((i) 

Assumption 4.2. The truncation parameter d is such that d < m and as n —)> oo, 
{lognY^^Bod = o(n^/®), (logn)^/^Bid = o(n^/^), (logn)^/^S2d = o(n^/^). 
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The next theorem establishes the validity of the univariate smooth test in (2.5). 

Theorem 4.1. Suppose that Assumptions 4-h and 4-2 hold. Then as n,m ^ oo, 

sup (d) = 1} - a| 0. (4.4) 

0<a<l 

Remark 4.1. In view of (4.2) and (4.3), it follows from Theorem 4.1 that the error in size 
of the smooth test ‘^^(d) using the trigonometric series (2.7) (resp. Legendre polynomials 
series (2.6)) tends to zero provided that d = o{(n/logn)^/^} (resp. d = o{(n/logn)^/®}) 
as n —>• oo. 

Next we consider the asymptotic power of 4’®(d) against local alternatives when d = 
dn,m —>■ oo as n, m —>■ oo. For the following results, let h = 2(n“^ + denote the 

harmonic mean of the two sample sizes. Our oracle statistic Ts given in (2.1) mimics 
\/nj2 maxi<fc<rf where 9 = {9i,..., 0^)’’’ G Consider testing Rq : P = 0 in 

(1.4) against the following local alternatives 

Hf:p = p 0 , for 0 G 0 := (ft = (6i,... ,6d)T G : max |6 a;| = (4.5) 

[ l<k<d \ n } 

where pe is as in (1.5) and A > 0 is a separation parameter. It is clear that the difficulty of 

testing between Hq and Hf depends on the value of A; that is, the smaller A is, the harder it 

is to distinguish between the two hypotheses. The power of the test <hQ(d) in (2.5) against 

Hf is provided by the following theorem. 

Theorem 4.2. Suppose that Assumption 4-d holds. The truneation parameter d = is 
sueh that d = o(n^/^) if the trigonometric series (2.7) is used to construct the test statistic 
'l'(d) in (2.3) and d = o(n^/®) if the Legendre polynomials series (2.6) is used. Then, under 
Hf with A > 2 + e for some e > 0, 

Urn P^J^l{d) = l} = l. (4.6) 

n,d^oo 1 

4.2 Asymptotic properties of <4)^®(d) 

In this section, we consider the multivariate case where the dimension p = Pn,m is allowed to 
grow with sample sizes, and hence our results hold naturally for the fix dimension scenario. 
Specifically, we impose the following assumption on the quadruplet {n,m,p,d). 

Assumption 4.3. There exist constants Co, Ci > 0 and ci G (0,1) such that 

d < min{n, m,exp(Cop)}, max {p^pBl^)<Cin^-^\ (4.7) 

The next theorem establishes the validity of the multivariate smooth test 4>^®(d). 
Theorem 4.3. Suppose that Assumptions 4-h and 4-3 hold. Then as n,m ^ oo, 

sup \PHo{<^^^{d) = 1} - a| 0. (4.8) 

0<a<l 
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5 Numerical studies 


In this section, we illustrate the finite sample performance of the proposed smooth tests 
described in Sections 2 and 3 via Monte Carlo simulations. The univariate and the multi¬ 
variate cases will be studied separately. 

5.1 Univariate case 

Proposition 7.1 in Section 7 shows that the distribution of the test statistic ^((i) in (2.3) 
can be consistently estimated by that of the absolute Gaussian maximum |G|oo, where 
G =fi A^(0,I(i). To see how close this approximation is, we compare in Figure 1 the 
cumulative distribution function of |G|oo and the empirical distributions of T(d) using the 
trigonometric series (2.7) and the Legendre polynomial (LP) series (2.6), when the data are 
generated from Student’s t(7)-distribution, with n = 180, m = 150 and d = 12. We only 
present the upper half of the curve since the (1 — a) quantile of |G|oo with a G (0,1/2) is 
of particular interest. It can be seen from Figure 1 that the cumulative distribution curves 
of |G|oo and the trigonometric series based statistic, denoted by T-T(fi), almost coincide, 
while there is a slightly noticeable difference between |G|oo and the LP polynomial series 
based statistic LP-T((i). Indeed, this phenomenon can be expected from the theoretical 
discoveries. See, for example, the rate of convergence in (7.1) and (7.2), where dependence 
of {Bid\i=Q,i ,2 on d can be found in (4.2) and (4.3). 

Next, we carry out 5000 simulations with nominal significance level a = 0.05 to calculate 
the empirical sizes of the proposed smooth test 4>®(d). We denote with T-4>®((i) and LP- 
(d), respectively, the tests based on the trigonometric series (2.7) and the LP polynomial 
series (2.6). The sample sizes (n, m) are taken to be (80,60), (120,90), (180,150), and d takes 
values 4, 8,12. We compare the proposed smooth test with the testing procedure proposed 
by Bera, Ghosh and Xiao (2013), the two-sample Kolmogorov-Smirnov test and the two- 
sample Cramer-von Mises test in five examples when the data are generated from Gamma, 
Logistic, Gaussian, Pareto and Stable distributions. The results are summarized in Table 
1, from which we see that among all the five examples considered, the empirical sizes of 
T-4>®(d) with d G {4,8,12} are close to 0.05. This highlights the robustness of the testing 
procedure T-4>®(d) with respect to the choice of the truncation parameter d. Further, 
we note that the empirical sizes of LP-4>® (4) are comparable to those of 4 >bgX) while as d 
increases, the test LP-<I'® (d) suffers from size distortion gradually. In fact, as pointed out by 
Neyman (1937) and Bera and Ghosh (2002), when the Legendre polynomials series is used 
to construct the test statistic, the effectiveness of the corresponding test in each direction 
could be diluted if d is too large. Nevertheless, the test based on the trigonometric series 
remains to be efficient as d increases and can be very powerful as we shall see later. 
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Figure 1: Comparison of the empirical cumulative distributions of LP-'I'(12), T-'I^(12) and 
the limiting cumulative distribution with n = 180 and m = 150. The plot is based on 5000 
simulations. 



The power performance is evaluated through the following five examples. In each exam¬ 
ple, the result reported is based on 1000 simulations where samples sizes (n, m) are taken 
to be (120,90) and (180,150). Because of the distortion of empirical sizes of LP-$®(ci), we 
only compare the power of the trigonometric series based smooth test T-<l>®(d) with that 
of the KS, CVM and BGX tests. The plots of power functions against different families of 
alternative distributions from Examples 1-5 are given in Figure 2. 

Example 1. 

X : F = uniform { — 1, 1) versus 

Y : G = Gn with density gAx) = —h 2x- —J^/(|x| < fi) (0 < /i < 1). 

2 

Example 2. 

X : F = uniform ( — 1, 1) versus 

Y : G = Ga with density ga{x) = -{1 + sin(27rcTx)} (0.5 < cr < 5). 

Example 3. 

X : F = lognormal(0,1) with density f{x) = (27r)“^/^x“^ exp{ —(logx)^/2} versus 
Y : G = Ga with density ga{x) = f{x){l -|-asin(27rlogx)} (—1 < a < 1). 
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Table 1: Comparison of empirical sizes with nominal significance level a = 0.05 





T-$|(d) 



LP-$|(d) 


BGX 

KS 

CVM 

Model 

(n, m) 

II 

00 

II 

d = 12 

d = 4 

00 

II 

d = 12 




Gamma(2,2) 

(80,60) 

0.0504 

0.0500 

0.0490 

0.0584 

0.0724 

0.1078 

0.0634 

0.0494 

0.0524 


(120,90) 

0.0504 

0.0510 

0.0484 

0.0542 

0.0654 

0.0830 

0.0590 

0.0438 

0.0434 


(180,150) 

0.0496 

0.0486 

0.0484 

0.0500 

0.0554 

0.0706 

0.0530 

0.0440 

0.0444 

Logistic(0,l) 

(80,60) 

0.0498 

0.0496 

0.0482 

0.0576 

0.0748 

0.1038 

0.0618 

0.0456 

0.0494 


(120,90) 

0.0504 

0.0500 

0.0496 

0.0528 

0.0666 

0.0860 

0.0552 

0.0504 

0.0466 


(180,150) 

0.0502 

0.0498 

0.0500 

0.0508 

0.0570 

0.0696 

0.0574 

0.0438 

0.0424 

N(0,1) 

(80,60) 

0.0504 

0.0488 

0.0470 

0.0570 

0.0764 

0.1060 

0.0648 

0.0494 

0.0504 


(120,90) 

0.0502 

0.0494 

0.0482 

0.0548 

0.0694 

0.0850 

0.0566 

0.053 

0.0504 


(180,150) 

0.0502 

0.0514 

0.0516 

0.0504 

0.0544 

0.0616 

0.0542 

0.0446 

0.0500 

Pareto(0.5,l,l) 

(80,60) 

0.0502 

0.0484 

0.0468 

0.0582 

0.0766 

0.1064 

0.0640 

0.0460 

0.0494 


(120,90) 

0.0500 

0.0494 

0.0494 

0.0540 

0.0640 

0.0824 

0.0592 

0.0468 

0.0480 


(180,150) 

0.0498 

0.0496 

0.0500 

0.0530 

0.0586 

0.0724 

0.0542 

0.0436 

0.0456 

Stable(1.5,0,l,l) 

(80,60) 

0.0480 

0.0470 

0.0456 

0.0544 

0.0758 

0.1088 

0.0606 

0.0474 

0.0488 


(120,90) 

0.0496 

0.0498 

0.0494 

0.0578 

0.0692 

0.0790 

0.0614 

0.0492 

0.0514 


(180,150) 

0.0510 

0.0514 

0.0506 

0.0508 

0.0570 

0.0690 

0.0608 

0.0490 

0.0484 


Example 4. 

X : F = uniform {0, 1) versus 

Y : G = Gc with density gc{x) = exp{csin(57rx)} (0 < c < 2). 

Example 5. 


X : F = uniform {0, 1) versus 

Y : G = Gc with density gdx) = 1 + ccos(57rx) (0 < c < 2). 

The first two examples are, respectively, Examples 5 and 6 in Fan (1996) which were 
designed to demonstrate the performance of the adaptive Neyman’s test proposed there. 
In Example 1, when /i = 0, coincides with F. For this family of alternatives index by 
/i, the strength of the local feature depends on g in the sense that the larger the g, the 
stronger the local feature. As expected, the powers of all the tests considered grow with g 
and when sample sizes are large enough, the smooth tests T-$®(d) uniformly outperform 
the others. Example 2, on the other hand, is designed to test the global features with 
various frequencies. It can be seen from the second row in Figure 2 that the test T-4>®(I6) 
has the highest power that approaches to 1 rapidly as a decreases to 0. The third example 
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is from Heyde (1963), where Qa is a density and has the same moments as /o of any order. 
In this example, all the KS, CVM and BGX tests suffer from very poor power, while 
surprisingly, the smooth tests based on the trigonometric series remain powerful. The last 
two examples aim to cover the high-frequency alternations. Again, the proposed tests have 
the highest powers. In fact, the BGX test was originally constructed to identify deviations 
in the directions of mean, variance, skewness and kurtosis, and hence it can be relatively 
less powerful in detecting local features or high-frequency components. 

5.2 Multivariate case 

The computation of the proposed multivariate smooth test and the critical value requires to 
find optimal directions Umax and Umax the unit sphere 5^“^ that maximize non-smooth 
objective functions (3.3) and (3.6), respectively. To solve these optimization problems, we 
convert the data into spherical coordinates and employ the Nelder-Mead algorithm. As a 
trade-off between the power and the computational feasibility of the test, we keep the value 
of d fixed at 4. 

Similar to the univariate case, we first carry out 5000 simulations with nominal signif¬ 
icance level a = 0.05 to calculate the empirical sizes of the proposed test T-<k^®(d) with 
trigonometric series. For each p G {3, 5,10}, the data are generated from multivariate nor¬ 
mal and t-distributions with different degrees of freedom (4 and 8) and covariance structures 
{Ip and S). Sample sizes {n,m) are taken to be (180,160). We summarize the results in 
Table 2, comparing with the method proposed by Baringhaus and Franz (2004), which will 
be referred as the BF test. From Table 2 we see that when p = 3,5, both methods have an 
empirical size fairly close to 0.05; when p = 10, the empirical size of the proposed smooth 
test increases since the optimization over the unit sphere becomes more challenging, while 
the empirical size of the BF test is typically smaller than the nominal level. 

The power performance of the multivariate smooth test is evaluated through Examples 
6-9. The first two are multivariate versions of Examples I and 4, which demonstrate, 
respectively, the alternations with local feature and high frequency. The last two examples 
are designed to examine a rotation effect in the alternations. In each one, the power reported 
is based on 1000 simulations where samples sizes (n, m) are taken to be (180,160). Again, 
we compare the power of the trigonometric series based smooth test T-<1>® (d) with that of 
the BF test. The power curve are depicted in Figure 3. 
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Power Power Power Power Power 


Figure 2: Empirical powers for Examples 1-5 based on 1000 replications with a = 0.05 




Example 2 (n,m)=(1 20,90) 



cy 


Example 2 (n,m)=(180,1 50) 







Examples (n,m)=(120,90) 



c 


Examples (n,m)=(180,1 50) 
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Table 2; Empirical size with significance level a = 0.05. 


Multivariate Normal Multivariate t 




N{0,lp) 

E( 0 ,S) 

o' 

1 

^ 8 ( 0 ) Ip) 

^ 4 ( 0 , S) 

13 ( 0 , S) 

p = 3 


0.0446 

0.0456 

0.0514 

0.0442 

0.0494 

0.0458 

BE 

0.0480 

0.0494 

0.0504 

0.0488 

0.0448 

0.0484 

p = 5 


0.0496 

0.0472 

0.0494 

0.0560 

0.0450 

0.0514 

BE 

0.0466 

0.0472 

0.0502 

0.0458 

0.0488 

0.0484 

p = 10 


0.0582 

0.0594 

0.0512 

0.0516 

0.0570 

0.0602 

BE 

0.0422 

0.0454 

0.0364 

0.0422 

0.0482 

0.0438 

Example 6. 







E = 

(Ei,E2,^3) 

T, Ei,E 2 

uniform{—l, 1 ), E 3 = 

= 0.3Ei + 0 . 7 X 2 versus 

E = 

(Ei,E2,E3)T, 

Ei,E 2 l" 

9ui^) = \ 

„ u — \x\ ^, 
+ 2x „ /( 

|x| < p) (0 

</^< 1 ) 



F 3 = O.STi + 0.7^2. 


Example 7. 


X = {Xi,X 2 ,X^y, Xi,X 2 ~ uniform{0,l), X^ = 0.3Xi + 0 . 7^2 versus 
Y = (El, E 2 , E 3 )t, El, y 2 ~ gc{x) = exp{csin( 57 ra:)} (0 < c < 2), E 3 = 0.3Ei + O. 7 E 2 . 


Example 8. 


E~AA( 0 ,/ 5 ) versus Y = AZ, Z~AA( 0 ,l 5 ), 
where "V = f ;iL 

\ 0 h r \ V5 


(0 < (5 < 0.5). 


Example 9. 


E~t 4 ( 0 ,/ 5 ) versus Y = AZ, Z~t 4 ( 0 , Is), 

To 0 \ ^ f a /1 — 6 \/6 


where A = 


0 h 


An = 


yf5 y/l^ 


(0 < (5 < 0.5). 


Eigure 3 shows that the proposed smooth test uniformly outperforms the BE test in all 
the examples in terms of power. Since we are using trigonometric series, the test is powerful 
especially if the data contains high frequency components (Example 7), which is difficult 
to be detected by the BE test. 
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Figure 3: Empirical powers for Examples 6-9 based on 1000 replications with a = 0.05 


Example 6 (n,m)=(1 80,160) Example 7 (n,m)=(1 80,160) 



6 Discussion 

We introduced in this paper a smooth test for the equality of two unknown distributions, 
which is shown to maintain the pre-specified significance level asymptotically. Moreover, 
it was shown theoretically and numerically that the test is especially powerful in detecting 
local features or high-frequency components. 

The proposed procedure depends on a user-specific parameter d, which is the number 
of orthogonal directions used to construct the test statistic. Theoretically, the size of d 
is allowed to grow with n and can be as large as o(n'^) for some 0 < c < 1. Since the 
optimal value of d depends on how far the two unknown distributions deviate from each 
other, it is not possible to practically define an optimal choice of d. As suggested by 
our numerical studies, d = 10 is a reasonable choice when the sample sizes are in the 
order of 10^, which leads a good compromise between the computational cost and the 
performance of the test. Alternatively, a data-driven approach based on a modification 
of Schwarz’s rule was proposed by Inglot, Kallenberg and Ledwina (1997), that is, d = 
argmax;^<^<£)(„ „j){r(d) — dlog(n-|-m)} for some D{n,m) —)• oo as min(n,m) —)■ oo, where 
T(d) is the test statistic using the first d orthonormal functions. This principal can be 
applied to the proposed testing procedure by setting D{n,m) to be some large value, say 
20. Nevertheless, the optimal choice of D[n,m) remains unclear. 
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The computation of the multivariate test statistic 'I'max(c?) requires solving the opti¬ 
mization problem with an t' 2 -norm constraint. To solve this problem when the dimension 
p is relatively small, we first convert the data into spherical coordinates and then use 
the Nelder-Mead algorithm. An interesting extension is to combine our method with the 
smoothing technique as in Horowitz (1992). Let K : M 1 —)• M be a symmetric, bounded 
density function. For a predetermined small number h = hn > 0, 'ipu,k is approximated by 
a continuous function 4’u,k,h = where 



As /i —>• 0, converges to almost surely, and hence for each A: G [d], sup^g^p-i \'tjjk,u,h\ is 

a smoothed version of sup^g^p-i |V'u,fc|- The smoothing technique can be similarly applied 


to the multiplier bootstrap statistic. Consequently, we can employ the gradient descent 
algorithm to solve the optimization for smooth functions. We leave a thorough comparison 
of various algorithms for different values of p as an interesting problem for future research. 

7 Proof of the main results 

In this section we prove Theorems 4.1-4.3. Proofs of the lemmas and some additional 
technical arguments are given in the Appendix. Throughout this section, we write N = 
n + m and use C and c to denote absolute positive constants, which may take different 
values at each occurrence. We write a < 6 if a is smaller than or equal to h up to an 


absolute positive constant, and a > 6 if 6 < a. 


7.1 Proof of Theorem 4.1 


Recall that G = (Gi,... ^GdY is a d-dimensional standard Gaussian random vector, the 


distribution of |G|oo is absolute continuous so that P{|G|oo > Ca(d)} = a. Therefore, under 
the assumption that d < n Am, the conclusion (4.8) follows from the following proposition 
immediately. 

Proposition 7.1. Assume that the conditions of Theorem f.l hold and let 



(log d)T2 




(7.1) 


Then under Hq : F = G 



t>o 


(7.2) 


The proof of Proposition 7.1 is provided in Section 7.4.1. 


□ 
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7.2 Proof of Theorem 4.2 


For the d-dimensional Gaussian random vector G, applying the Borell-TIS (Borell-Tsirelson- 
Ibragimov-Sudakov) ineqnality (van der Vaart and Wellner, 1996) yields that for every t > 0, 
P{|G|oo > -F(|G|oo) +t} < exp(-t^/2). By taking t = y/2 log(l/a), we get 

Caid) < F;(|G|oo) + -y21og(l/Q;), (7.3) 

where Ca{d) denotes the (1 — Q!)-quantile of |G|oo- A standard result on Gaussian maxima 
yields Fi(|G|oo) < {1 + (2 log d)~^}^2 logd. 

Let k* = argmaxfcgjrf] \9k\ under Hf and assume without loss of generality that 6k* > 0. 
By (7.7) and (7.10) in the proof of Proposition 7.1, we have 

^ I / TtTTl/ '' 1 

> <^a{d)} > ^k* > Ca(d) > 

— I ^ ^ -^2^*) > Ca((i)|, (7.4) 

where A: = y/n/mlipkiVj) -'dk}I{j G [rn]} + ^/mJnhlk{Xj-rn)I{j G m + [n]} for {j,k) £ 
[AT] X [d] with F,- = F(y,), ^k = EHfiMV)} and h,k{x) = > F{x)}-V]). 

Note that E{hik{X)} = 0 and thus E{^jk) = 0. Let £{ti,t 2 ) be as in (7.12) for ti,t 2 > 0 
to be specified. Put <5 = tiB 2 d + + *JmnjN{9k* — '&k*) + \/2 log(l/a), then it follows 

from (7.3) and (7.4) that 

EHf{^ > CQ,(d)} 

2 E + 2iid) 7^+- \/w 

2 - l)V^d + s{ - 

In particular, taking ti = tin{d) x n~^^‘^*s/logd and t 2 = t 2 n{d) ^ n“^/^logd implies by 
(7.16) that P{£{ti,t 2 T} —)• 0 as d —)■ oo. Further, by (8.8) and the conditions of the 
theorem, we have 6 = o(-v/log d). Gonseqnently, as d —)■ oo, 

> c„(d)} > <-|\/logdJ - P{£{ti,t2y} 

This completes the proof of Theorem 4.2. □ 
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7.3 Proof of Theorem 4.3 


We first introduce two propositions describing the limiting null properties of the multivariate 
smooth and multiplier bootstrap statistics used to construct the test. The conclusion of 
Theorem 4.3 follows immediately. 

The first proposition characterizes the non-asymptotic behavior of the multivariate 
smooth statistic which involves the supremum of a centered Gaussian process. 

Let T = be as in (3.4) and for simplicity, the dependence of T on {p, d) will be assumed 
without displaying. 

Proposition 7.2. Suppose that Assumptions 4-1 CLud 4-3 hold. Then there exists a centered, 
tight Gaussian process G indexed by T such that under the null hypothesis Hq : F = G, 

sup |P{4'max(d) < - P(I|G||^ ^ ^ Cn"'", (7.5) 

t>o 

where C and c are positive constants depending only on co,ci,Co and Ci. 

Proposition 7.2 implies that the “limiting” distribution of Tmax depends on unknown the 
covariance structure given in (3.5). To compute a critical value we suggest to use multiplier 
bootstrapping as described in Section 3.2. The following result, which can be regarded as a 
multiplier central limit theorem, provides the theoretical justification of its validity. In fact, 
the construction of the multiplier bootstrap statistic ^m2c(^) involves the use of artificial 
random numbers to simulate a process, the supremum of which is (asymptotically) equally 
distributed as ||G||j- according to Proposition 7.3 below. 

Proposition 7.3. Suppose that Assumptions 4-1 o.nd 4-3 hold. Then with probability at 
least 1 — 3n~^, 

sup \Pe{^^tid) <t}- P(||G||^ < t) I < Cn-^ (7.6) 

i>0 

for G as defined in Proposition 1.2, where C and c are positive constants depending only 
on co,ci,Cq and Cl. 

Proofs of the above two propositions are given in Section 7.4. □ 


7.4 Proof of Propositions 7.1—7.3 


7.4.1 Proof of Proposition 7.1 

For every k € [d], it follows from (2.2) and Taylor expansion that 


1 

m 


TTl 7Tl 7TL 772 

- v,) + ^ - v,? 

j=i j=i j=i j=i 

- 772 ^ 72 772 

= „ E'*‘('"j) + EE S Yj) - F{Y,)]+Ru, (7,7) 


i=i 


i=l j=l 
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where Rik '■= {2m)~^ and Cj is a random variable lying between Vj 

and Vj. It is straightforward to see that Rik < gUV'^'Hoo niaxi<j<m(V^' — Vj)"^- A direct 
consequence of the Dvoretzky-Kiefer-Wolfwitz inequality (Massart, 1990), i.e. for every 
t > 0, P{\/nsup 2 , \Fn{x) — F{x)\ > t} < 2exp(—2t^), is that 

p(n max |-Rifc|/||V’fc||oo > t] < 2exp(-4t). (7.8) 

\ l<k<d J 

Let hk{x,y) = 'il)'j^{F{y)){I{x < y) — F{y)} for G M be a kernel function M x 
M I— >■ M. Then the second addend on the right side of (7.7) can be written as Un,m{k) = 
{nm)~^ Yl^=iYl^=ikk{^i,'Fj) with E{hk{X,Y)} = 0. Observer that Un,m{k) is a two- 
sample 17-statistic with a bounded kernel hk satisfying hk '■= ||/iA:||oo < llV’felloo and 

al := E{hk{X, y)2} = E{{V- V^W^{Vf} < WkfjA. (7.9) 


Let hik{x) = Ejjd{hk{X,Y)\X = x} and h 2 k{y) = Fj^d{hk{X,Y)\Y = y} be the first order 
projections of the kernel hk under Hq. Since X and Y are independent and under Hq, 
V = F(Y) =d Unif(0,1) under Hq, we have /i 2 fe = 0 and 

hik{x) = E{'^l:'^,{V)[I{V >F{x)}-V]) = f i;kiv)dv- [ v^^'kiv) dv =-ij^kiE{x)). 

Jf(x) Jo 

Define random variables Ui = F{Xi) =d Unif(0,1) that are independent of Vj = F(Yj). 
Then, using the Hoeffding’s decomposition gives 

. n . n m 1 ^ 

Un,m{k) = --Y^MUi) + — Ty^hok{Xi,Yj) := + R 2 k, (7.10) 

2 = 1 2 = 1 j = l 2 = 1 


where hok{x,y) = hk{x,y) - hik{x) - h 2 k{y)- 

In view of (7.7) and (7.10), we introduce a new sequence of independent random vectors 
{^j = {Cji, ■ ■ ■,CjKy}f=i ioi N = n + m, defined by 


ijk 


y/n/mipkiVj) l<j<m, 

-y/mJnipkiUj-m) m + l<j<N. 


(7.11) 


Put xp = (V'l,.. R-i = (72ii,.. • ,7?irf)T and R 2 = (R 21 , • • ■■,R 2 dV, such that 



N 


y/N 




i=i 



(Ri -|- R2). 


Recall that {V'o = Ij V’li • • ■ i V’rf} is a set of orthonormal functions and 17 =d Unif(0,1) 
under Hq. By (7.11), the covariance matrix of i® ^O^al to 1^. 
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(7.12) 


For any ti,t 2 > 0, define the event 
d 

£{h,t2) = {Vm\Rik\ < Wi’kW^ti} n {Vm\R2k\ < ||^/'fc||ooi2}- 

k=l 

Under Hq, we have for every t > 0, 




= Pi max 
l<fc<d 


= Pi max 
l<k<d 


n 

mN 




i=i 


< t 




< t 


<-P{ <t + ip^{tiB2d + t2Bid)^+P{£{ti,t2T}, (7.13) 


where B^d [i = 1,2) are as in (4.1). To get rid of the absolute value in (7.13), a similar 
argument as in the proof of Theorem 1 in Chang, Zhou and Zhou (2014) gives 

N X / ^ N 


P ( max 
l<k<d 




)s E f*' s i) = p( ^ E fF s ‘ 


( 7 . 14 ) 


where is a sequence of dilated random vectors taking values in defined 

by • • •) = (^})“^})^' III view of (7.14), we only need to focus on 

maxi<fc<rf J2f=i^jk without losing generality. 

Note that ^jk are bounded random variables satisfying E{^jk) = 0 and \^jk\ < y^llV’^Hoo 
Applying Lemma 2.3 and Lemma 2.1 in Chernozhukov, Chetverikov and Kato (2013), re¬ 
spectively, yields 


sup 

teiR 


P 


max 


N 


i<k<d y/N ^ 


Cjk <t] - P 


max Gk < t 
l<k<d 


< 

rsj 


{log(dn)}^/® 


n 


1/8 


Bd 


where Bd := [E{m.a.-yii<k<d\d’k{V)\^}]^/'^ < G = (Gi,...,Gd)T =d N{0,ld) and for 

every e > 0, 


supPI 

max Gk — t 

iSM V 

l<k<d 

The last two displays jointly imply 


4e(l + ^2 log d). 


<p(^ max^Gfc +c|M^^^Bo^/V(logd)V2(t,S2d + t2Birf)|. 


(7.15) 
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For P{E{ti,t 2 Y'\ in (7.13), it follows from (7.8) and (8.3) in Lemma 8.2 that 

d 

P{£{ti,t 2 f] < 2exp(-4tin/Vm) + '^^P{y/m\R 2 k\ > \\Y'k\\oot2/2) 

k=l 

< exp(—4ti-v/ra) + (iexp(—ct 2 -\/n)- (7-16) 


Taking ti x {'y 2 nn) t 2 ^ {'Jinn) in (7.13) implies by (7.15) and (7.16) that 

PH,{^<t) <P(|G|oo <t)+C 


{log(dn)}^/® 3/4 /— ' 

Pod P vTln T 'JyYn 


n 


1/8 


where 7 ^ {i = 1,2) are as in (7.1). Here, the last inequality relies on the fact that 
supi>o(te“*) < e~^. A similar argument leads to the reverse inequality and thus completes 
the proof. □ 


7.4.2 Proof of Proposition 7.2 

In view of (4.7), we assume without loss of generality that Bid < Let T = Tj he 
the product space x [d]. For every u G 5^“^, let = Fn{u'^Yj), 17“ = F^{u'^Yj), 
j G [m] and = F'^{u^Xi), i G [n]. By Taylor expansion and arguments similar to those 
employed in the proof of Proposition 7.1, we obtain that for every {u, k) G T, 



where Un,m{u, k) := Y17=i V'fc(Lj“)7{u'''(Aj — F)) < 0 } is a two-sample [/-statistic with 

E{Un,ni{u,k)} = YkiY under Hq and < 5 HV^fc lU niax^gj^] (I7“ - I/“)2. Let 

n = n^^ = {K,k{-, ■):MP xRPe^ M |(u, k) G T} (7.18) 

be a class of measurable functions, where hu^k{x^y) = i/^(,(F“(u'''y))/{uT(a; — y) < 0 } for 
x,y £ M^. For ease of exposition, the dependence of T and 77 on (p, d) will be assumed 
without displaying. In the above notation, each h = hu^k £ 77 determines a two-sample 
[/-statistic Un,m{h) := = Un,m{u,k), such that {Un,m{h)}hen forms 

a two-sample [/-process indexed by the class 77 of kernels. Moreover, define the degenerate 
version of 77 as 

Po = {ho{x,y) = h{x,y) - {PYh){x) - {Pxh){y) + {Px x Py)(/i) \h G 77}, (7.19) 

where for h{-, •) : x 1 —>■ M, 

iPxh){-)= [ h{x,-)dFix), {PYh){-)=[ h{;y)dG{y) 
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and {Px X PY){h) = f f h{x,y) dF{x) dG{y). Under Hq, it is easy to verify that for every 
{u,k) E T, {Px X PY){hu,k) = 


{PYhu,k){x) = V’fc(l) - il^kiP'^iu^x)) and {PxK,k){y) = 'il)'k{F'^{u^y))F'^{u^y)- 


In addition to % and Ho, we define the following class of measurable functions on 


J=' = fu,k{x) ='il^k° fu{x) : /c E [d], /n E F^] 


(7.20) 


where 



(7.21) 


with PP = {(x,y) !-)■ I{F{x — y) < 0} : tt E 5*’ ^}. 
Together, (7.17) and (7.19)-(7.21) lead to 



(7.22) 


where ||C4,m||wo = \Un,m{ho)\ 



(7.23) 


with {Zi,.. .,Zx} = {hi,.. .. .,Xn} and wj = y^n/ml{j E [m]} - y/m/nl{j E 

[n] + m} for j = 1,..., N. 

With the above preparations, the rest of the proof involves three steps; First, approxi¬ 
mation of the test statistic Tmax by To requires the uniform negligibility of the right side of 
(7.22). Second, we prove the Gaussian approximation of Tq by the supremum of a centered, 
tight Gaussian process G indexed by F with covariance function 



(7.24) 


for {u,k),{v,i) E T- Finally, we apply an anti-concentration argument due to Cher- 
nozhukov, Chetverikov and Kato (2014b) to construct the Berry-Esseen type bound. 


Step 1. The following two results show the uniform negligibility of the right side of (7.22). 
Lemma 7.1. Assume that the conditions of Proposition 1.2 hold. Then under Hq : F = G, 


E{\\Un,m\\Ho) < B 2 dV{P + '^ogd)nm. (7.25) 

Lemma 7.2. With probability at least 1 — 2n“^, we have 


nm. 


(7.25) 



(7.26) 
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By (7.25) and (7.26), it follows from the Markov inequality that for t > 0, 


P{{nm) ^^‘^\\Un,m\\Ho > ^B 2 d\/p + logd (7.27) 


for any t > 0 and with probability at least 1 — 2 n \/nmsup („\ Ru,k\ ^ i? 2 d\/p + logn. 

Taking t = 'y~^B 2 d\/p + logd for some 7 G (0,1) in (7.27) implies by (7.22) that 


Pi 


> R VP + '^osd + logn , ^ 
' ‘2^d _ /77” 1 


n 


7 + re 


-1 


(7.28) 


Step 2. The following result establishes the Gaussian approximation for 'I'o. 


Lemma 7.3. Assume that the conditions of Proposition 1.2 hold. Then under Hq, there 
exists a centered, tight Gaussian process G indexed by P = given in (7.20) with covari¬ 
ance function (7.24) and a random variable 4'* =d ||G||j- = supjgj- |G/| such that for every 
7e (0,1), 


P< 14'n - > B 


id- 


K log n ^ ^ 1/2 (Kd log ^ ^ 1/3 {pp log re)2/3 | 


re 


Id 


l/2„l/4 + ^Id ^l/3„l/6 I 


7 


< 7 + re ^ log re, 


(7.29) 


where = p + log d. 

By (7.28) and (7.29) with = p + logd, 

Pjl^'max -^*\> Ai 47)} < A2n(7), (7-30) 


where 


Ain (7) = B2d 


(Ts:^+ logre)^/2 TC^logre i/2(i^^logre)3/4 1/3 (it'^logre)2/3 

1 , _j_ ^ \ 


n 


+ Bid 


re 


Id 




+ B 


Id 


71 / 3 ^ 1/6 


and A 2 n( 7 ) = 7 + re ^ log re. 


Step 3. Now we restrict attention to the Gaussian supremum T*. By Corollary 2.2.8 in van 
der Vaart and Wellner (1996) and (8.17), we get 

E^* < [ /suplogA^(T', L 2 {Q),e) de < y/p + \ogd. 

Jo y Q 


Combined with Corollary 2.1 in Chernozhukov, Chetverikov and Kato (2014b), this implies 
for every e > 0 that 

sup P{\^* -t| < e) <e^/p + logd. (7.31) 

t>o 
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Together, (7.30) and (7.31) yield, for every t > 0, 

< t) < < t + CAi„( 7 )} + CA2n(7) 

< P(T* <t) + C'{Ai„( 7 )i/p + logd + A2n(7)}- 


A similar argument leads to the reverse inequality. Finally, in view of (4.7), taking 

^ f Dl/2(P + logn)l/'^ 1/4^1 

7 = 7n(p,d) =max<'52^ --,5^^ (logn) / 


1 / 8 ’ 

.5/6 


^iT(logn)2/3- 


1/3 


completes the proof under the assumption d < min{n, m, exp(C'op)}. 


□ 


7.4.3 Proof of Proposition 7.3 


Throughout the proof, is a sequence of i.i.d. standard normal random variables and 

Pe denotes the probability measure induced by {ei}2=i holding fixed. For every 

(u, k) G T = X [d], by Taylor expansion we have 


-71 -71 - 


2=1 


2=1 


EE hu,k{^ii Xj) Ru,k) 

i=l j=l 


(7.32) 


where = (ei,A7 )T e RP+\ 


hu,k{xi,X2) = ei i/ife o F“(uTxi) [l{u'^{x2 - xi) < 0} - F^{u'^xi )], xe = (e^, xjy E 


for f = 1, 2 and the remainder Ru,k is such that 

|-R«,fc| < :^.B 2 d\/nmax|ei| X sup {Uf-Uff. (7.33) 

2 is[n] (n,i)SiSP“l X [n] 

Because ei and Aj are independent, we have Fi{/iM^fc(Ai, A 2 )|Ai} = Fi{/iu^fc(Ai, A 2 )|A 2 } = 
0 so that { X]r=i Z]/=i ^j)}(u fc)g 7 - forms a degenerate [/-process. With slight 

abuse of notation, we rewrite the function hu^k as hu,k{xi-,X 2 ) = ei • Wu,kixi,X 2 ), where 


Wu,k{xi,X 2 ) = 'ip[{F'"{u^xi)) [I{u^{x 2 - Xi) < 0} - F'^{u^xi)]. 

In this notation, we have := {hu,k ■ iu,k) E T} C {e 1 —)• e} • with = {wu,k '■ 
{u,k) E T}. Arguments similar to those employed in the proof of Lemma 8.4 can be 
used to prove that the collection is VC-type, and so is with envelop H given by 
H{x) = H{e,x) = B 2 d|e|, such that 


supA(7i2,L2(Q),e||^||Q,2) <d-iA/erP (7.34) 

Q 
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for some constants A > 2e and v >2. This uniform entropy bound, together with Theo¬ 
rem 6 in Nolan and Pollard (1987) yields 


E{ sup 

{u,k)£T 


< B2dn{ - + 


EE hu,k (-^i) 

i=l j=l 
/■1/4 


'0 


sup ^JlogN{'H^^,L 2 {Q),e\\H\\Q^ 2 ) dej < B 2 dn^/p + logd (7.35) 


by following the same lines as in the proof of Proposition 7.1. 
For Ru,k, applying the Borell-TIS inequality gives 


P< max |ei| < E 

[ *e[n] 


max \ ei 

ie[n] 



< exp(— 1 ^/ 2 ) 


for every t > 0 . A standard result on Gaussian maxima is that i7(maxjg[„] \ei\) < 2y/\ogn. 
Consequently, combining Proposition 7.2 and (7.33) implies that 


sup 

(u,k)eT 


Ru,k\ < B2d{logn 




p + log{dn) 


n 


(7.36) 


holds with probability at least 1 — 3n“^, 

By (7.32), (7.35) and (7.36), a similar argument to that leading to (7.28) gives, on this 
occasion that for any 7 G (0,1), 


P 


I^MB _ 

I ^ max ^ 0 


> 




< 


7 + n 


-1 


(7.37) 


where 


tJ, = sup 

X [d] 



i=l 


= sup 
/6.F 



^=1 


(7.38) 


for P = as in (7.20). 

Notice that 'I'o is the supremum of a (conditional) Gaussian process indexed by 
P with covariance function Ee{{G'^fu,k){G'^ fv/} = n~^ YA=i'^k{F'^{u^Xi))iii{E^{PXi)). 
Next we use an approximation due to Chernozhukov, Chetverikov and Kato (2014b). Let 
Xn = {^ 1 , • • • ,Xn} be a realization of the data. Theorem A.2 there shows that for every 
5 > 0, there exists a subset Gn such that P{Xn G G„) > 1 —3n“^ and for every Xn G one 
can construct on an enriched probability space a random variable such that =d ||G|| j- 
for G as in Lemma 7.3 and that 


P 




-^■^1 > ^ + 


K log n ^ ^ 1/2 {RP log n)3/4 


n 


Id 


n 


1/4 


A'. 


< B 


i/2(A:^logn)3/4 


Id 




-I- n 


-1 


(7.39) 
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where = p + log d. 

Finally, combining (7.31) with inequalities (7.37) and (7.39), and setting 
7 = i32^^{(j> + logn)/n}^/'^ and (5 = i?J^^(logn)^/®(p/n)^/® 
complete the proof of (7.6) in view of (4.7) and (7.31). □. 

8 Proof of technical lemmas 

We provide proofs here for all the technical lemmas. Throughout, we use C and c to denote 
universal positive constants, which may take different values at each occurrence. 

Lemma 8.1. Let {^i,i > 1} be a sequence of independent random variables with zero 
means and finite variances. Put Sn = '^n — then 

for any x > 0, 

T’jl'S'nl > 2 ;(nn + 46,i)} < 4exp(—x^/2) and (8.1) 

E[Sll{\Sn\ > x{vn + 46„)}] < 23bl exp(-xV4). (8.2) 

Proof of Lemma 8.1. The proof is based on Theorem 2.16 in de la Pena, Lai and Shao 
(2009) and Lemma 3.2 in Lai, Shao and Wang (2011). □ 

Lemma 8.2. Assume that the conditions of Proposition 7.1 hold, then for every A: > 1 and 
t > 0, 


P{^/nrn\R 2 k\ > CiWip'^Woo t) < (72 exp(-t/4), (8.3) 

where Ci,C 2 >0 are absolute constants. 


Proof of Lemma 8.2. Without loss of generality we only prove the result for t > 4, otherwise 
we can simply adjust the constant C 2 so that ( 72 exp(—1/4) > 1 for 0 < t < 4. For given 
A: > 1, define Qi = with qij = qij^k = Kk{Xi,Yj) for hok as in (7.10). Put 

Ty = c{Ti) • • ■) Ym}, such that given Ty, {Qi}f=i forms a sequence of independent random 
variables with zero (conditional) means. Noting that YlJLi Qij = ^11=1 Qi^ h follows 

from a conditional version of (8.1) that for any t > 4, 


P 


i=l j=l 


^^Qij >t 


i=l 


1/2 


+ 4 


Y,E{Q^\Ry)\ 

i=l ' 


'I 1/2 


TV <4exp(-tV2). (8.4) 


We study the tail behaviors of XlILi Qi E{Q‘1 \Ey) separately, starting with 

Qi- Observe that given W, Qi is a sum of independent random variables with zero 
means. Put = Yl^=iQlj = Yl'f=iE{q]^j\Xi). A direct consequence of ( 8 . 2 ) is 
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that for every t > 0, E[Qfl{\Qi\ > t{Vi + ABi)}\Xi] < 235^^ exp(—1^/4). This implies by 
taking expectations on both sides that 


E[Q‘il{\Qi\ > t{Vi + 4Bi)}] < 23E{Bf) exp(-tV4), (8.5) 

where E{Bf) = YlJLi < 'fnE{hk{X,Y)‘^} = mal for as in (7.9). Together, (8.5) 

and Lemma 7.2 in Shao and Zhou (2014) imply, for t > 4, 

< 92t“"^exp(—1^/4) < (1/2) exp(—1^/4). (8.6) 
We consider next ^{Q‘i\^y)i which can be decomposed as 


P 








E{Q1\^y) 

= E[Qp{\Qi\ < t{Vi + 4 B,)}\Py] +E[Qp{\Qi\ > t{Vi + 45,)}|Ty)] 

< t^E[{Vi + 4Bif\PY] +E[Qil{\Qi\ > t{Vi + 45,)}|Wy)] 

m 

< 17tVai + 17t2^E(g2 |y,) + ^[Q2/{|Q,| > + 4Si)}|Wy)]. 

i=i 


Hence, it follows from Markov’s inequality and (8.5) that 


P 




nmaf. 


i=l j=l 


Y,E{Qj\PY)>18t 

i=l 

( ^ 

<P( Y,E[Qp{\Qi\>t{Vi + 4B^)}\PY)] >t^nmal 

^ i=l 

n 

< t-2(nmai)-i J]i7[Q2/{Q2 > t\V, + 4B,f]] 


2=1 


< (3/2) exp(—1^/4). 


(8.7) 


By (7.9), we have ||/iofc||oo < ‘2hk- Then combining (8.4), (8.6) and (8.7) gives, for t > 4, 


P 


/nm 


i=l j=l 


> Ci(ak + bk)t ^ < 6exp(-t/4). 


This completes the proof of Lemma 8.2. □ 

Lemma 8.3. Assume that the conditions of Theorem J^.2 are fulfilled, then for all suffi¬ 
ciently large n, 

max|^?fe - 6lfc| < (8.8) 

k&[d] n 

where t)k '■= E^d{'fik(y)} with V = F{Y). 
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Proof of Lemma 8.3. Under the alternative Hf, the density of U = F{Y) is of the form 
P 0 {z) = Cd{0)exp{6'^tp{z)}, where {Cd{9)}~^ = exp{ 6 'Ti/>( 2 ;)} dz and -if = ,' 0 d)T. 

In this notation, we have 'dk = Cd{9) fg ipk(^) exp{9'’'-ip(z)} dz. Note that 


\9'^'if{z)\ 


d 


'^9k^pk{z) 

k=l 


< Bodd'^ max \9k\ 

k&[d\ 


A-Bod dL 



Consequently, using the inequality |e* — 1 — t| < exp(t V 0) which holds for every t G 
to t = I • 0 ( 2 ;) I yields 

[ fjk{z)exp{9'^'il){z)}dz= [ ipk{z){l + 9 '^il 7 {z)}dz + 0 {l) [ \'tfk{z)\{9'^ii’iz)}'^ dz 
Jo Jo Jo 


= '^9e 'ilJk{z)'ip£{z)dz + 0{l)Bg^d^'" 


logd 


e=i 


n 


= 9k + 0 {l)Bt,dd 


2 , 2 rlogd 


n 


uniformly over /c G [d]. Similarly, it can be proved that 


exp{d''' 0 ( 2 :)} dz — 1 


< r 2 ,2rlogd 


which implies Cd{9) = l + o(l) as d, re —>■ oo. Combining the above calculations proves ( 8 . 8 ). 

□ 


Lemma 8.4. Under the null hypothesis Hq : F = G, the class PLq of degenerate kernels 
X RP !->■ M, to which an envelop = 2 B 2 d is attached, is VC-type; that is, there are 
constants A > 2e and v >2 such that 

sup N{no, L 2 {Q), 2 eB 2 d) < d • {A/eyP, (8.9) 

Q discrete 

where the supremum ranges over all finitely discrete Borel probability measures on x M^. 

Proof of Lemma 8 . 4 . First we prove that the class LL of kernels is VC-type. Note that LL 
has envelop = i? 2 d and admits the partition PL = where for each /c G [d], the class 

JJ-k = {hu,k & PL : u G SP~^} has an envelop < i? 2 d- This implies 

d 

sup N{PL,L 2 {Q),eB 2 d) sup N{PLk, L 2 {Q),eB 2 d), (8.10) 

Q discrete k—1 ^ ‘discrete 

where the supremum ranges over all finitely discrete Borel probability measures on §2 := 
X ML. Therefore, it suffices to restrict attention to the class PLk with a fixed k G [d]. 
For every u G observe that hu^k{x, y) = 'f’k° fuiv) ' Ju{x, y). Regarding each element 
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of := {?/ I—>■ V’fc o fu{y) '■ fu £ F^} as a measurable function on § 2 , i-e. {x,y) i—>■ 

ip'k ° fu{y), we have Tik C 'il}'f^{FP) ■ for F^ and X^ given in (7.21). Since both the classes 
F^ and X^ have envelop = 1 and the function is Lipschitz continuous, it follows from 
Lemma A.6 and Corollary A.l in Chernozhukov, Chetverikov and Kato (2014a) that, for 
any 0 < e < 1, 


and 


sup N{i^',,{FP),L 2 {Q),eB 2 d)< sup N, L 2 {Q), e) (8.11) 

Q discrete Q discrete 


sup N{nk,L 2 {Q), 2 eB 2 d) 

Q discrete 

< sup N{'ilj'^{FP),L 2 {Q),eB 2 d) sup N {JF, L 2 {Q), e) , ( 8 . 12 ) 

Q discrete Q discrete 

where the suprema appeared above are taken over all finitely discrete Borel probability 
measures on § 2 - 

In view of (8.11) and ( 8 . 12 ), it remains to focus on the classes F^ and X^. Arguments 
similar to those in Sherman (1994) can be used to control the entropies of X^. To see this, 
define V = {u(-, •,•;«) : u G M^} and W = {u)(-,-,-; 7 ) : 7 G M}, where v{x,y,t',u) = 
Fx — Fy and w{x,y,t'-,l) = 7 t for x,y G and t G M. Note that V (resp. W) is a 
p-dimensional (resp. 1-dimensional) vector space of real-valued functions on S 2 x M. By 
Theorem 4.6 in Dudley (2014), the class of sets of the form {z : v(z) > s} or {z : v{z) > s} 
with u G V for some s G M fixed is a VC class with index p+ 1. For every u G 5^“^, 

graph(4) = {(x,y,t) G S 2 X M : 0 < t < 4(x,y)} 

= — u'^y < 0 } n {t > 0} n {t > 1}'^ 

= {ui > O}'^ n {rci > 0} n {w 2 > 1}'^, 

where ui G V and wi,W 2 G W. Together with Lemma 9.7 in Kosorok (2008), this im¬ 
plies that {graph(/„) : /„ G X} forms a VC class with index < p -|- 3. Consequently, by 
Theorem 9.3 in Kosorok (2008), there exist constants a > 2e and c > 2 such that 

supV(XP,L2(g),e) < {a/eyP (8.13) 

Q 

for any 0 < e < 1, where the supremum is taken over all Borel probability measures on 
§ 2 - For F^, applying Lemma A.2 in Ghosal, Sen and van der Vaart (2000) combined with 
(8.13) gives 

supN{FP,L2{Q),2e) < sup N{1^, L 2 {Px x Q),e^) < (^^/e)^"^, (8.14) 

Q Q 

where the supremum ranges over all Borel probability measures on 


35 


Together, (8.10)-(8.14) imply the VC-type property of the class T-L. 

We consider next the class T-Lq of degenerate kernels under Hq, which admits a partition 
similar to (8.10), i.e. "Hq = Observe that for each {u,k) G T, 

K,k{x, y) = hu,kix, y) + tpkO r{x) + o /“(y), x,y £ 

where G Tiok C G T" and 0fc(s) := — s'0(,(s) for 0 < s < 1. For any u,v G 5^“^ 

and k G [d], we have \(j)k o /„(y) - 4>k o fviy)\ < 2B2d\fuix) - fviy)\. This, together with 
Lemma A.6 in Chernozhukov, Chetverikov and Kato (2014a) yields 

sup N{MJ^n,L2{Q),2eB2d) < sup N{J^p, L 2 {Q), e). (8.15) 

Q discrete Q discrete 

On combing (8.11), (8.12) and (8.15), and recalling the permanence of the uniform en¬ 
tropy bound under summation that is implied by Lemma A.6 in Chernozhukov, Chetverikov 
and Kato (2014a), we obtain 

sup N{nok,L 2 {Q),^eB 2 d) 

Q discrete 

< sup N{nok,L 2 {Q),sB 2 d) 

Q discrete 

X sup N{MBP),L2{Q),eB2d) sup A^(4(^), L2(Q), ei?2d) 

Q discrete Q discrete 

< sup N{lP,L2{Q),e/2)\ sup N{BP,L2{Q),e/2) 

Q discrete L Q discrete 

This completes the proof of (8.9) in view of (8.13) and (8.14). □ 



8.1 Proof of Lemma 7.1 

Observe that {Um,n{ho)}hoe'Ho forms a degenerate two-sample [/-process indexed by Tdo 
and by Lemma 8.4, "Ho is VC-type with envelop = 2 i? 2 d- The entropy bound given in (8.9) 
now allows us to apply Lemma 2.4 in Neumeyer (2004), yielding 


E{\\Un,m\\no) 

< B 2 dV^\ 7 + / sup -v/log NCHo, L 2 {Q), 2eB2d) de 

-/O Q discrete 

< 52d\/^| \ + J V^ogd + vplog{A/e) de 


f _ roo 

B 2 d\/nm\ 7(1 + \/log d) + ^/vp / -s/logt dt 
14 JiA 


(8.16) 
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For any a > e, it follows from integration by parts that 


roo 1*00 

/ t~^y^logtdt = a~^\/\oga H— / {log dt 
Ja 2 

roo _ _ 1*00 

< a~^y/loga+ —- / t~^y/logtdt < a“^i/loga + - / 

2logay„ 2^ 


t ‘^^ylogtdt. 


Substituting this into (8.16) proves (7.25). 


□ 


8.2 Proof of Lemma 7.2 

Define the class Q = {x 9 u,t{x) = I{v^x < t) : {u,t) G 5^“^ x M} of indicator functions 
on closed half-spaces in such that 

1 

Dn{Q) ■■= sup |F“(t) - F^{t)\ = sup - ^{g{Xi) - Pxg} , 

(ti,t)e5p-ixR geg ^ 

where Pxg '■= P{g(X)}. Note that, for every (u, t) G T, YaT{gu,t{X)} = F“(t){l —F“(t)} < 
1/4. A direct consequence of Theorem 7.3 in Bousquet (2003) is that, for every t > 0, 

p(Dn{G) > E{Dn{G)} + [i + 4E{DniG)}] ^ 2e-*. 

To control the expectation E{Dn{G)}, first it follows from Theorem B in Dudley (1979) 
that the class ^ is a VC-subgraph class with index p + 2, such that for any probability 
measure Q on and any 0 < e < 1, N{G,L 2 {Q),e) < This, together with 

Proposition 3 in Gine and Nickl (2009) gives 

E{Dn{G)}<\[^+^. 

^ \ n n 

Since Dn{G) < 1, the last three displays together complete the proof of (7.26). □ 


8.3 Proof of Lemma 7.3 

To prove (7.29), a new coupling inequality for the suprema of empirical processes in Cher- 
nozhukov, Chetverikov and Kato (2014a) plays an important role in our analysis. Recall in 
the proof of Lemma 8.4 that the collection is VC-type, from which we obtain 

d 

sup N{P,L 2 {Q),eBM) <Y1 sup N{MEn,L 2 {Q),eBu) < d ■ {A/erP (8.17) 

Q discrete discrete 

for some constants A > 2e and u > 2, where P = P^. This implies by Lemma 2.1 
in Chernozhukov, Chetverikov and Kato (2014a) that the collection is a VC-type pre- 
Gaussian class with a constant envelop = Bid- Therefore, there exists a centered, tight 
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Gaussian process G defined on F with covariance function (7.24). Moreover, for any integer 
k>2, 

supPx|/|"<i?o"d' sup E{MUn^} = B^^^ (8.18) 

/S.F (u,fc)ScSr'“l X [d] 

where we used the fact that [/“ = F^{FX) =d Unif(0,1) and hence £^{V’fc(?7“)^} = 1 for 
all (u, k) G 5^“^ X [d]. 

The entropy bound (8.17) and the moment inequality (8.18) now allow to apply Corol¬ 
lary 2.2 in Chernozhukov, Chetverikov and Kato (2014a), yielding (7.29). □ 
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